Project 4. Wrangle and Analyze Data - WeRateDOgs

TABLE OF CONTENTS

I.Introduction

II.Gathering Data

III.Assesing Data

IV.Cleaning Data

V.Visualyzing Data

I. Introduction

The goal of this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

II. Gathering data

I wil gather data from the following datasets:

1.Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets provided by Udacity: twitter_archive_enhanced.csv

2.Data via the Twitter API

I tried to gather data via Twitter (X) API. I went through all the necessary steps (see the commented codes below). I have learned on the way, that due to the change of conditions for the Twitter (X) developer free account I wouldn't be able to use API as you cannot use the Search Tweets feature using the Free Level Access on the Twitter API. More information: https://developer.twitter.com/en/docs/twitter-api

In this case, I decided to continue with the project working on Udacity JSON file provided for those who choose to access the Twitter data without actually creating a Twitter account. However, I wouldn't be able to keep track of tweets provided by Udacity that have been deleted (as to do this I would need API access). All the coding steps leading me to this decision, are kept in the Jupyter notebook in the commented sections.

3.Additional data for the 3000 most recent tweets: tweet image predictions:image_predictions.tsv downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

4.Udacity json file for retweet count and favorite count: tweet_json.txt downloaded programmatically using the Requests library and the following URL: https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt

GATHER

First, we will import the Python libraries

No, we will download and read to dataframe the twitter (X) archive csv files provided by Udacity

We will use URL provided by Udacity to download the data of the second dataset and read it into a dataframe programatically.

We will save the the data into csv file

The Twitter (X) archive provided by Udacity does not contain all of the necessary data (retweet and favorite counts). I tried to:

(As confidential, the consumer_key, consumer_secret, access_token, and access_secret have been replaced here by generic strings).

Please see the commented codes below.

From I have learned, Twitter Free Developer Account access rules has changed lately. You cannot use the Search Tweets feature using the Free Level Access on the Twitter API. More information: https://developer.twitter.com/en/docs/twitter-api

In this case, we will use Udacity off-line recources starting from copy-pasting and hashtagging the first part of code with API key and tokens

Now, we will read the json file provided by Udacity line by line to create a dataframe with tweet_id, favorite_count, retweet_count, retweet status and url

We will read the csv file and store it in th tweet_data dataframe

ASSESSING DATA

After gathering the three pieces of data, I will:

I will use two types of assessment:

  1. Visual assessment: each piece of gathered data is displayed in the Jupyter Notebook for visual assessment purposes. Once displayed, data will be additionally assessed in an external application (here: text editor).
  2. Programmatic assessment: pandas' functions and methods are used to assess the data.

VISUAL ASSESSMENT

Visual assessment of the dataframes:

x_archive:

  1. Instead of separate columns for each type ('doggo','floofer', 'pupper','puppo'). It should be combined into one column 'Dog Type' .
  2. Some names in the name column are missing, and some appear to be incorrect (for example, there is a name 'a' which would be rather unusual as a dog's name.

predictions:

  1. The data for prediction 1, prediction 2, and prediction 3 are spread out over several columns. There should be one column with the prediction number and the additional columns with the actual prediction, confidence, and whether the prediction is type of dog breed.
  2. Columns p1, p2, and p3: the capitalization needs to be unified.
  3. To check: there are predictions (like paper-towel or spatula) which are not 'dog'. Are there records with all three predictions that are not 'dog'? if yes, they should be probably ignored in this analysis.
  4. To check: the dashes and underscores in prediction columns - would it cause any problems or should it stay the way it is.

tweet_data

  1. The data looks clean

PROGRAMMATICAL ASSESSMENT:

x_archive

We will start with info() to find out the potential issues like null values or data types

To keep in mind for Assessment Summary: The timestamp column has Dtype 'object' while it should be datetime, The dog type should be rather a categorical Dtype, x_archive contains retweets, which should be deleted as first: we focus on the original content from WeRateDogs, second: it might confuse the results (the same urls)

There are retweets in the x_archive which most probably have the same url what can cause confusion. We will need to correct it later.

The above result confirms that the values are in the same rows

We have a column with expanded urls. I needed to check (Google search) what an expanded url is in order to decide if the column should stay in the final dataframe. An expanded URL is the full, original URL that a shortened URL or a redirection link resolves to. It reveals the actual destination of these shortened links, so that you know exactly where you're being directed For safety and transparency of user's web browsing experience, this column is important. Let's have a closer look on the urls and to what content they realte by sampling the column randomly

I want to see the whole expanded url so first I will set display options. Then, I will use sample() to randomly desplay examples of expanded_urls

Looks like mostly these urls are related to photo/video content. I am curious, however, if there are links related to different content. Let's check

I am not sure to what exactly these URLs relate. Let's check how many of such results we have in the column to decide if it should concern us in this project.

The urls related to something different than photo or video makes 6% of total, so it is realtively small. For this project, we will keep this column as it is. We will just remove rows with null values for clarity. Let's check how many null values is in the column.

We will need to remove the rows with missing values though

As inconsistences has been noticed in visual assessment, we will check for common values in x_archive 'name' column

For the Assessment Summary: None should be replaced by NaN for clarity (even if there is a small chance that people name their dog 'NONE' ;))
The names should start with a upper case so let's check if there are values in 'name' column starting with a lower case

For the Asssessment Summary: This finding could be investigated more deeply, but for the needs of this project, it is safe to conclude that these values are not valid as names

We will check if there are duplicated values in x_archive

We will check for tendencies and shape of dataset distribution.

2.Predictions assessment:

We will start with info() and then: check for duplicated values and check the data distribution using describe()

No null values datatypes are correct

The most possible explanation for the duplicted jpg_url is that these are retweets Let's check the number of retweets between the duplicated urls

For the Assessment Summary: as assumed initially, the duplicated jpg_url should be removed from the predictions table

The result looks normal

I assume that at least one of the three predictions needs to be 'dog'. Otherwise, the record has no value for this analysis

For Assessment Summary: we should probably remove these records

Based on earlier visual assessment, I will have a closer look on dashes and underscores in p1, p2 and p3 columns.

For the Assessment Summary: looks like the dashes are used in dog's descriptions and underscores are used in the name of the dog breed; I assume, it is safe to keep the data as it is. There is no need to change the dashes to underscores in this case (the way it is originally is even better; for example, if someone would like to split the values into two columns 'description' and 'breed', it will be easier based on the dashes and underscores)

3.tweet_data assessment

We will use info(), isnull(), duplicated()and describe() to check for inconcistences

The result looks good. No null values, Data types are correct

Looks like there are tweets that have 0 favorite_count and 0 retweet_count. Let's have a quick on the correlation between favorite_count and retweet_count

There is a positive correlation between favorite counts and retweet counts. As the number of favorites increases, the number of retweets also tends to increase.

To keep in mind for the future analysis after the data is clean and combined into single daaframe: investigate further the correlation between these two values.

ASSESSMENT SUMMARY:

Tidiness:

x_archive

These columns: doggo, floofer, pupper, and puppo are not clean. These columns should be combined into one column 'dog type'. The not valid values should be removed.

predictions

The columns that are directly related to predicion (p1, p2, p3, p1_conf, p2_conf, p3_conf, p1_dog, p2_dog, p3_dog) are not clean. These columns should be combined adequatly into columns 'prediction', 'dog_status', 'confidence'. After performing the above, the column 'prediction_num' should be added for clarity.

Quality:

x_archive

Timestamp should be in the datetime format The columns (doggo, floofer, pupper, and puppo) should be in the categorical data type Values in dogs names column should be corrected ("None", lowercase words, etc.)

predictions

The prediction confidence float should be limited to two decimal places for clarity Ensure that the new prediction_num column has the data type int The first letter capitalisation for values in p1, p2, and p3 should be corrected Values in dog predictions (p1, p2, p3) should be categorical data type (we can change it after combining into one column) The duplicated jpg_url that are related to retweets should be deleted

tweet_data

There are 0 values in retweet_count and favorite_count As the rows with 0 for the favorite_count are related to retweets so I assume they will be deleted anyway in the process of cleaning prediction from duplicated_urls.

CLEANING DATA

We will complete the following items in Cleaning section:

Before we wil perform the cleaning, we will make a copy of the original data.

During cleaning, we will use the define-code-test framework and clearly document it.

We will merge individual pieces of data according to the rules of tidy data.

The result will be a high-quality and tidy master pandas DataFrame.

We will start with making the copies of each dataframe

In the next steps, we will take care of retweets and duplicated jpg_urls (which are in other word the urls of the images used for the predeictions) to focus only on the original content from WeRateDogs. Also, we don't want any confusion caused by the same urls. We will:

As we assessed earlier, the duplicated jpg_urls are related to retweets do we don't want them in predictions. We will query the rows in predictions_clean with tweet_id present in archive_clean to remove retweets. At the same time, the duplicated values in the jpg_url column will be removed.

Now, we will:

Prepare the new column dog_status from p1_dog, p2_dog, and p3_dog with values True if status is 'dog' and False if status is not a dog.

The prediction column will be created from p1, p2, p3.

The corresponding prediction_num column with the prediction number (1, 2 or 3) will be created.

The dog_status True/False values and the the values in a new columns predition and prediction_num have to correspond with each other, so we will include mapping in the melting process.

If we would like to work with the datfrmae where tweet_ids are not duplicated, we could group by tweet_id and keep only the predictions with the highest confidence.

I decided to keep duplicated tweet_id as the records for them are not identical.

But if I would decide to do the opposite, I would use the code hashtagged below with the following steps:

Remove rows where p1_dog, p2_dog, and p3_dog are all False (i.e., none of the predictions are dogs). Group predictions into one column and their corresponding confidence scores into another column. Ensure that tweet_id is not duplicated. Steps Remove Duplicated jpg_url:

Drop duplicates based on jpg_url while keeping the first occurrence. Remove Rows Where None of the Predictions are Dogs:

Filter out rows where p1_dog, p2_dog, and p3_dog are all False. Group Predictions and Confidence Scores:

Combine the p1, p2, and p3 columns into a single prediction column. Combine the p1_conf, p2_conf, and p3_conf columns into a single confidence column. Ensure No Duplicated tweet_id:

Group by tweet_id and keep the prediction with the highest confidence.

We will continue with melted_df, which copy we will use as predictions_clean

it is We rate dogs project, so we want to keep only the records with images of dogs. In order to do so, we need to remove the records for tweet_id in archive_clean which are not present in predictions_clean.

For further analysis, the data type of the values in column prediction_num should be integer.

In archive_clean, We will correct the values in the dog name column and combine the stage columns.

Remove the values starting with the lowecase letter as these are not names

Replace None with NaN in the 'name' column for clarity

Some issuses are still there. Like I am not certain if naming the dog Churlie or Clybe is intentional or it is a misspelling; however, for this project, we will keep this column as it is. These are still names, so it should not affect the results of analysis.

Now we will work on archive columns: doggo, foofer, pupper, puppo. We aim for one column instead of four.

We will check for inconsistencies such as:

Duplicate dog type values within the same dog_types entry.

Mixed case (upper and lower case) stage values.

Multiple identical stages listed.

In order to do so we will: Create the dog_types column as previously described.

Define a function to check for inconsistencies in the dog_types column.

Apply this function to identify any rows with inconsistencies.

We have two values per row for 9 rows. For clarity, we will replace records with double values ('doggo, pupper', 'doggo, puppo', 'doggo, floofer') with 'multiple'.

We will correct the data type of Timestamp column to the datetime

I remember that, as advised by Udacity,the fact that the rating numerators are greater than the denominators does not need to be cleaned and that this unique rating system is a big part of the popularity of WeRateDogs. However, of pure curiosity, I would like to check how many values in the rating_numerator and rating_denominator columns have atypical format of numerator/denominator, in other words, which aren't within a reasonable range (e.g., numerator values typically range from 0 to 100 and denominator values are typically 10)

The number of rows with atypical (or, mathematically speaking invalid) nominator/denominator is very low. it makes less than 1% of the whole dataframe.

I will drop columns in_reply_to_status_id, in_reply_to_user_id, text, expanded_urls as not useful for the use of this project

We will check if there are different sources than iPhone. If yes, we will keep this column for the analysis. if not, we will drop it.

There are clients who used web or tweet deck. It might be useful for further analysis, so we will keep it.

We will also change rename the column dog_types to dog_type as more adequate

We will change dog_type column datatype to category. It might be useful for model building and plotting in the analysis, and leads to faster operations particularly with functions like group by (https://stackoverflow.com/questions/30601830/when-to-use-category-rather-than-object) (https://pandas.pydata.org/docs/user_guide/categorical.html)

Dataframe archive_clean is ready

Now, we will continue with tiding the prediction_clean dataframe. The upper'lower cases in prediction column should be solved

We will change all upper cases to lower cases in prediction column

We will round decimal places to two in confidence column for more clarity

Prediction column should have categorical datatype

predictions_clean dataframe is ready

3.tweet_data_clean

We will use info(), head() and describe() for the summary and data distribution of tweet_data_clean

The tweet_data_clean dataframe looks clean, the only issue we need to address is 0 in favorite counts. There are no retweets in archive_clean, so merging tweet-data_clean with archive_clean, will solve the problem with 0 favorite counts. We will merge archive_clean with tweet_data_clean.

Then, we will merge the new dataframe with predictions_clean and store the result in our final master dataframe master_df

The data_tweet_clean (stored in master_df) is clean and ready

Now, we will prepare the final Master dataframe by marging the master_df with predictions. Just in case, we will add a code to ensure that there are only dogs in the dataframe (dog_status==True)

The final master dataframe is ready. We will store it in a master csv file

ANALYSING AND VISUALISING DATA

We will analyze and visualize the wrangled data.

We will produce at least three (3) insights and one (1) visualization. We willl clearly document the piece of assessed and cleaned (if necessary) data used to make each analysis and visualization.

1 INSIGHT We want to know what are the most popular dog breeds based on the number of posts, interactions by Twitter users (favorite and retweet counts), and ratings.

In order to do so we will:

Calculate dog_rating as the ratio of rating_numerator to rating_denominator.

Filter the data to include only rows where dog_status is True.

Group by dog breed (prediction) to calculate the number of posts (num_posts), total interactions (total_interactions), average interactions (avg_interactions), and average rating (avg_rating).

To identify the 10 Most Popular Breeds in each category, we will sort the breeds by num_posts, total_interactions, and avg_rating.

We will create bar plots to visualise the top 10 dog breeds by the number of posts, total interactions, and average rating.

FOR THE SUMMARY:

  1. Top 10 Most Popular Dog Breeds by Number of Posts:

This Labrador Retriever and Golden Retriever are the most popular dog breeds by the number of posts, significantly outpacing other breeds.

Chihuahua, Pembroke, and Cardigan also have a substantial number of posts.

Pomeranian, Toy Poodle, Chow, French Bulldog, and Cocker Spaniel complete the top 10, indicating a diverse range of breeds being popular on Twitter.

  1. Top 10 Most Popular Dog Breeds by Total Interactions:

Labrador Retriever and Golden Retriever lead in total interactions, mirroring their popularity in the number of posts.

Pembroke and Chihuahua also rank high in total interactions.

Cardigan, Pomeranian, French Bulldog, Cocker Spaniel, Chesapeake Bay Retriever, and Chow make up the rest of the top 10. The engagement (favorite + retweet counts) follows a similar pattern to the number of posts, with popular breeds maintaining their high interaction rates.

  1. Top 10 Dog Breeds with Highest Average Rating:

Chow and Pomeranian have the highest average ratings, indicating that when these breeds are posted about, they tend to receive very positive ratings.

Samoyed and Saluki also have high average ratings, suggesting strong liking or approval from Twitter users.

Tibetan Mastiff, Entlebucher, Briard, Kerry Blue Terrier, Bouvier des Flandres, and Clumber round out the top 10, showing a mix of both common and rare breeds receiving high ratings.

This metric highlights breeds that might not have the highest post or interaction counts but are highly rated when they do appear.

Summary: Labrador Retriever and Golden Retriever dominate in both the number of posts and total interactions, indicating their overall popularity.

Chihuahua, Pembroke, and Cardigan also show significant presence and engagement.

The highest average ratings feature some breeds that are not necessarily the most posted or interacted with, suggesting that specific breeds generate strong positive sentiment among those who do post about them. These insights can be useful for understanding trends in dog breed popularity and engagement on Twitter.

3 INSIGHT What is the correlation between Favorite Counts and Retweet Counts

FOR THE SUMMARY

The correlation coefficient between favorite counts and retweet counts is 0.91.

The correlation coefficient is a measure of the strength and direction of the linear relationship between two variables. It ranges from -1 to 1: +1 indicates a perfect positive linear relationship. 0 indicates no linear relationship. -1 indicates a perfect negative linear relationship.

A correlation coefficient of 0.91 indicates a very strong positive linear relationship between favorite counts and retweet counts.

This means that tweets that receive a high number of favorites also tend to receive a high number of retweets, and vice versa.

In other words, as the number of favorite counts increases, the number of retweet counts also increases proportionally.

Scatter Plot Analysis:

The scatter plot visually confirms the strong positive correlation. (The points are closely clustered along an upward-sloping line, suggesting that higher favorite counts are associated with higher retweet counts).

There are a few outliers where tweets have an exceptionally high number of favorites or retweets, but the overall trend is clearly positive.

Conclusion The high correlation and the scatter plot together suggest that favorite counts and retweet counts are strongly related. When analyzing tweet popularity or engagement, both metrics are likely to move together, making either a good indicator of the other.

We want to explore the relationsip further.

We will perform a linear regression analysis to model the relationship between favorite counts and retweet counts.

This will allow us to predict retweet counts based on favorite counts more precisely.

Building the model:

1.Prepare the Data:

We will extract the favorite_count as the independent variable (X) and retweet_count as the dependent variable (y).

2.Add a Constant:

We will add a constant term to X using sm.add_constant(X). This is necessary to include the intercept in the regression model.

3.Fit the Model:

We will use the OLS method from statsmodels to fit the regression model. We will call the fit method to estimate the model parameters.

4.Print the Model Summary:

The summary method will provide a detailed summary of the regression results, including the coefficients, R-squared value, p-values, and more.

5.Interpretation of the Regression Results: The summary output will include:

Coefficients: The intercept and slope of the regression line. R-squared: The proportion of the variance in the dependent variable that is predictable from the independent variable. P-values: Statistical significance of the coefficients. Standard Errors: The accuracy of the coefficients' estimates.

Visualise the result in a scatter plot but this time with the regresion line

FOR THE SUMMMARY:

The results of the OLS regression analysis from the provided summary.

Key Points:

Dependent Variable: retweet_count

Independent Variable: favorite_count

R-squared: 0.830 This indicates that approximately 83% of the variance in retweet counts can be explained by the favorite counts. This is a very high value, suggesting a strong relationship.

Adjusted R-squared: 0.830 Similar to the R-squared, this value is adjusted for the number of predictors in the model. It is also very high, indicating a strong fit.

Coefficients:

Intercept (const): -368.9635 This is the predicted value of retweet_count when favorite_count is zero. It suggests that without any favorites, a tweet might have negative retweets, which isn't practical but can be interpreted as no baseline retweets.

Slope (favorite_count): 0.3505 For each additional favorite, the model predicts an increase of approximately 0.35 retweets. This coefficient is highly significant (p < 0.000).

Statistical Significance: P-values: Both the intercept and the slope have p-values < 0.000, indicating that they are statistically significant at any common significance level (e.g., 0.05).

T-statistics: The t-values for both coefficients are very high, further confirming their significance.

Model Diagnostics: F-statistic: 2.786e+04 with a p-value < 0.000

The model as a whole is statistically significant. Durbin-Watson: 0.444

This value is used to detect the presence of autocorrelation. Values closer to 2 are ideal. A value of 0.444 suggests positive autocorrelation. Condition Number: 1.89e+04

This indicates potential multicollinearity issues. However, since we only have one predictor, it might indicate numerical stability issues.

Conclusion:

The regression model shows a very strong positive relationship between favorite counts and retweet counts.

For each additional favorite, retweets are expected to increase by about 0.35.

The model explains a significant portion of the variability in retweet counts (R-squared = 0.83).

Both the predictor and the overall model are statistically significant.

We could continue the analysis with this regression model by extending it by including additional predictors: like the month or the dog breed. But for this project we will stop here.

4 INSIGHT

We want to identify trends in dog ratings over time,

We will create a time series plot. This plot will show how the average rating has changed over the years.

We can also visualize the number of tweets over time to see if there are any patterns or trends in tweet activity.

In order to do the above we will create and interpret the visualisations:

First, we will set the timestamp as the index of the DataFrame.

Then, we will resample the data by month to get the average rating and the count of tweets.

Finally we will plot the average rating over time and the count of tweets over time.

FOR THE SUMMARY:

The Average Dog Rating Over Time:

Overall Trend: The average dog rating shows a general stability with values hovering around 1.0 throughout most of the observed period.

There is a noticeable spike in the average rating around mid-2016, reaching above 3.0. This spike is an outlier compared to the rest of the data points. This could indicate a particular event or series of posts that received exceptionally high ratings during that time.

Following the spike, the average rating drops back to around 1.0 and shows slight fluctuations but generally remains stable. This indicates that the spike was a singular event rather than a sustained trend of higher ratings.

For the remainder of the period, from late 2016 to mid-2017, the average rating remains relatively stable, with minor increases and decreases but no other significant spikes.

Summary: The average dog rating is relatively stable over time, suggesting consistent user engagement and rating behavior for most of the period.

The significant spike (outlier)around mid-2016 could be investigated further to understand the cause—possibly a viral post or a highly popular dog breed that skewed the ratings temporarily.

The return to stability after the spike indicates that the high ratings were not a sustained trend but rather an anomaly.

5 INSIGHT:

Based on the above result, we want to check for seasonal patterns or trends in number of tweets by breaking down the data into smaller monthly time periods. We will prepare a visualisation in order to do so.

For the future reference: it would be interesting to conduct a comparison of the average ratings with other metrics like total interactions (favorites and retweets) or the number of posts to see if there are any correlations.

FOR THE SUMMARY:

The Number of Tweets Over Time:

The plot shows a high number of tweets at the beginning of the observed period, particularly around early 2016, peaking at over 1000 tweets in a single month.

Following the peak, there is a sharp decline in the number of tweets over the next few months. By mid-2016, the number of tweets drops significantly to around 200 per month.

After mid-2016, the number of tweets stabilizes, fluctuating around 200 tweets per month with minor variations. There is no significant upward or downward trend after the initial decline, indicating a period of steady activity.

Towards the end of the observed period, specifically around mid-2017, there is another noticeable decline in the number of tweets, dropping to below 100 tweets per month.

Summary:

There was an initial surge in the number of tweets, peaking in early 2016. This could be due to increased popularity or a specific event driving higher engagement.

The sharp decline after the peak indicates that the initial surge was not sustained. This could be due to various factors such as changes in user interest, etc.

The stabilization period indicates a consistent level of activity with around 200 tweets per month. It might be suggesting a loyal user base or stable content production.

The decline towards the end of the period might indicate a drop in engagement or content production.

Alle these result could be investigated further. For example, we could: 1.Analyze the content and context of tweets during the peak period to understand what drove the high volume of tweets. 2.Look into the content and engagement strategies during the stabilization period to understand what kept the engagement steady. 3.Investigate the factors leading to the decline in tweets towards the end of the period to identify potential areas for improvement or changes in strategy.

However, in this project, we will not investigate it further.

6 INSIGHT: We want to know what are the most popular dog types in the dataset. To check it, we will prepare a visualisation for distribution of dog types.

FOR THE SUMMARY:

Popularity of Dog Types: Dog type 'pupper' (1st place with count between 500 and 600) and 'doggo' (2nd place with count between 100 and 200) are significantly more common in the dataset, which could indicate either a bias in the data collection or genuine popularity among the Twitter users.

Credentials:

  1. I used official documentation for Python, NumPy, pandas and other libraries listed in #import libraries section.
  2. https://seaborn.pydata.org/tutorial/color_palettes.html
  3. https://seaborn.pydata.org/generated/seaborn.countplot.html#seaborn.countplot
  4. https://docs.tweepy.org/en/v3.2.0/api.html#API
  5. https://developer.twitter.com/en/docs/twitter-api (to address issuess with API connection)
  6. https://stackoverflow.com/questions/18307551/regex-pattern-to-find-all-lowercase-words
  7. https://stackoverflow.com/questions/30601830/when-to-use-category-rather-than-object
  8. https://www.statology.org/python-guides/
  9. https://www.statsmodels.org/dev/regression.html